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ABSTRACT 

The  problem  of  sorting  n   elements  using  p   processors  in  a  parallel  comparison 
model  is  considered.  Lower  and  upper  bounds  which  imply  that  for  p^n,  the  time 

complexity  of  this  problem  is  Q(- ^ )  are  presented.    This  complements 

log{\+p/n) 

[AKS-83]  in  settling  the  problem  since  the  AKS  sorting  network  established  that  for 

p^n  the  time  complexity  is  0( ■" — ).    To  prove  the  lower  bounds  we  show  that  to 

achieve  k  ^  \ogn  parallel  time,  we  need  ilin^^^'')     processors. 

1.   Introduction 

Apparently,  there  is  no  problem  in  Computer  Science  which  received  more  attention 
than  sorting.  [Kn-73],  for  instance,  found  that  existing  computers  devote  approximately  a 
quarter  of  their  time  to  sorting.  The  advent  of  parallel  computers  stimulated  intensive 
research  of  the  sorting  with  respect  to  various  models  of  parallel  computation.  Extensive 
lists  of  references  which  recorded  this  activity  are  given  in  [Ak-85],  [BHe-86]  and  [Th-83]. 

Most  of  the  fastest  serial  and  parallel  sorting  algorithms  are  based  on  binary 
comparisons.  In  these  algorithms  the  number  of  comparisons  is  typically  the  primary 
measure  of  time  complexity.  .\ny  lower  bound  on  the  number  of  comparisons  required  for 
a  problem,  clearly  implies  a  time  lower  bound  for  such  algorithms.  In  the  present  paper, 
we  restrict  our  attention  to  a  parallel  comparison  model  where  only  comparisons  are 
counted.  In  measuring  time  complexity  within  this  model,  we  do  not  count  steps  in  which 
communication  among  the  processors,  movement  of  data  and  memory  addressing  are 
performed.  We  also  avoid  counting  steps  in  which  consequences  are  deduced  from 
comparisons  that  were  performed.  Note  that  our  lower  bounds  apply  to  all  algorithms, 
based  on  comparison,  in  any  parallel  access  machine  (PRAM)  including  PRAMs  which 
allow  simultaneous  access  to  the  same  common  memory  location  for  read  and  write 
purposes.  See  [BHo-82]  for  a  discussion  on  hierarchy  of  models  that  implies  this. 

In  a  serial  decision  tree  model,  we  wish  to  minimize  the  number  of  comparisons.  The 
goal  of  an  algorithm  in  a  parallel  comparison  model  is  to  minimize  the  number  of 
comparison  rounds  as  well  as  the  total  number  of  comparisons  performed. 

Let  k  stand  for  the  number  of  comparison  rounds  aime)  of  an  algorithm  m  the  parallel 
comparison  model.  Given  an  algorithm,  let  inkjv  denote  the  roral  (over  all  rounds  of  the 
algorithm.)  number  of  comparisons  required  by  the  algorithm  to  sort  any  n  elements  in  k 


rounds.  uik,n)  is  the  upper  bound  on  the  number  of  comparisons  in  the  worst  case.  Let 
cik,n)  denote  the  minimum  total  number  of  comparisons  required  to  sort  any  n  elements  in 
k  rounds  (over  all  possible  algorithms). 

The  known  Qinlogn)  comparisons  lower  bound  for  sorting  in  a  serial  decision  tree  model 
implies  that,  for  any  k,  c{k,n)  =  £l{n\ogn).  This  lower  bound  can  be  matched  by  upper 
bounds  as  follows:  For  k  =  c\ogn,  the  sorting  network  of  [AKS-83]  obtains 
uik,n)  =  0(n\ogn)  ,  where  c>Q  is  a  constant  which  is  implied  by  the  network.  For 
k>c\ogn,  the  result  u{k,n)  =  Oinlogn)  also  holds.  To  see  this,  simply  simulate  the  AKS 
network  by  slowing  it  down  to  work  in  k  rounds. 

For  k  =  \,  c(\,n)  =  ^(n'^-n).  This  is  since  any  sorting  algorithm  which  works  in  one  round 
must  perform  all  comparisons.  Otherwise,  suppose  that  a  dispensed  comparison  is  between 
two  successive  elements  in  the  sorted  order;  the  algorithm  will  clearly  fail  to  distinguish 
their  order.  On  the  other  hand,  observe  that  performing  all  comparisons  simultaneously 
yields  an  one  round  algorithm  in  the  parallel  comparison  model  that  matches  exactly  this 
lower  bound,  i.e.,  ui  l,/i.)  =  %(«"-«). 

So,  it  remains  to  consider  the  situation  for  1  <  ^  ^  clog/i. 

We  state  our  main  result  which  is  proved  in  Section  3: 

;-} 

c(kM)>k( -  n)     for  an V  A:, A?  >  1,  where  ^  is  the  base  of  the  natural  logarithm. 

e 

Corollaries  of  the  main  result: 
Suppose  we  have  p  processors  with  the  interpretation  that  each  processor  can  perform  at 
most  one  comparison  at  each  round.  Observed  that  kp^c{k,n)  or  p^cik,n)/k  .  Therefore, 

Corollary    1.    Any    yt-round    (/:>!)    parallel    algorithm    for    sorting    n    elements    needs 

p  >  -  n        processors.    This  yields  p  =  Cl(n        )    for  k-<clogn   where-  c   is  any 

constant  such  that  0<<r<l. 

Corollary  2.  The  number  of  rounds  required  to  sort  n  elements  using  p^n  processors  is 

lo2/! 


i-  =  Of  ■ 


l02l  I    +   ^) 


Througnou;  ;he  paper,  log  refers  ;o  the  aaf.:rai  ioganihm. 
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1-i  ^ 

n      '^  .       ..       .      P        n''  ,,         ,         ,^  logn 


Proof.  p> -  n  implies  1  +  -^  > and  therefore  k> ■" .    Hence,    for 

l  +  log{\  +  ^) 


e  n         e 


~>  ■     f^,       log" 

102(1  +  -^) 
n 

Corollary  3.  If  p  =  n\og^n    for  3>0  then  the  number  of  rounds  required  to  sort  n  elements 

is   k  =Q  (-— — , ).   This  is  an  immediate  corollarv  of  Corollarv  2. 

^loglogn 

A  parallel  algorithm  is  said  to  achieve  optimal  speed  up  if  its  running  time  is  proportional 

to  — ,  where  Seq{n)  is  a  lower  bound  on  the  serial  running  time,  n  is  the  size  of  the 

P 

problem  being  considered  and  p  is  the  number  of  processors  used. 

Corollary  4.  If  the  number  of  processors  is  larger  than  n  by  an  order  of  magnitude  then  it 

is  impossible  to  design  an  optimal  speed  up  comparison  sorting  algorithm.    VIore  formally, 

suppose  that  the   number  of  processors  p    is   not  0{n)   (i.e.,  n  =  o(p))   then  there   is  no 

,  .  ,                          ^   n\o<s.n  ^ 
(comparison)  sorting  algorithm  which  runs  in  time  0( ■— ). 


Section  4  presents  upper  bounds  which  match  these  new  lower  bounds.    Specifically, 

1                     ^          \o'2.n 
we   describe   a  parallel  comparison   algorithm  that  sorts  n   elements   in   0{ " ) 

logd  +  ^) 
n 

rounds  using  p^«  processors. 

To  understand  better  the  significance  of  the  lower  and  upper  bounds  of  the  present  paper 

we  will  use  one  more  equivalent  formulation  of  the  results. 

Corollary  5 .  Suppose  we  are  given  p^n  processors  to  sort  n  elements.    The  total  number  of 
comparisons  performed  by  the  fastest  parallel  sorting  algorithm  is 
p/n 


0 


nlogn 


log(  \+p/n\ 

The    factor       nlogn    represents    the    serial    lower    and    upper    bounds    for    sorting    using 
comparisons.  The  other  factor  represents  the  deviation  from  optimal  speed  up. 
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In  studying  the  limit  of  parallel  algorithms  it  is  interesting  to  identify  asymptotically 
the  minimal  time  k  that  can  be  achieved  by  an  optimal  speed  up  algorithm.  We  call  this 
minimal  time  the  parallelism  break  point  of  the  problem  being  considered.  [Va-75]  proved 
that  0(/o^log«)  is  the  break  point  for  finding  the  maximum  among  n  elements.  [BHo-82] 
gave  a  lower  bound  and  [Kr-83]  an  upper  bound  to  prove  that  0(/o^logn)  is  the  break 
point  for  merging  two  sorted  lists,  where  n  is  the  length  of  each  list.  The  above  two  lower 
bounds  were  also  obtained  in  a  parallel  comparison  model  (which  is  therefore  often 
referred  to  as  Valiant's  model).  The  present  paper  enables  to  add  sorting  to  the  list  of 
problems  for  which  the  break  point  was  identified.  Specifically,  Corollary  4  complements 
the  sorting  network  of  [AKS-83]  in  proving  that  Q(logn]  is  the  break  point  for  sorting  n 
elements.  It  is  interesting  to  compare  the  "pattern"  in  which  the  break  point  occurs  in 
these  three  problems.  The  elegant  lower  bound  proofs  of  Valiant  and  Borodin-Hopcroft 
show  that  Cl(log\ogn)  rounds  are  required  if  n  processors  are  used  for  the  problems  of 
finding  the  ma.ximum  and  merging,  respectively.  The  algorithms  of  Valiant  and  Kruskal 

run    in    O(loe\op.n)    rounds    usine    - — ; processors    for    each    of    these    problems, 

■     loglogn 

respectively.   This   isolates  distinctly  the   break  points   for  these  two  problems  since  the 
asymptotic  time  bound  can  not  improve   by  increasing  the  number  of  processors    from 

- — \ to  n.  On  the  other  hand,  such  desenerate  isolation  does  not  occur  in  the  sorting 

loglog^^ 

problem.    Specifically,    Corollary    5    implies    that    increasing    the    number    of   processors 
asymptotically  always  yield  asymptotic   decrease  in  the  number  of  comparison  rounds. 

We  note  that  we  have  proved  a  first  non  trivial  lower  bound  in  a  parallel  comparison 
model  for  logAj  time.  The  problem  solved  in  this  paper  was  open  for  sometime. 
Interestingly,  our  proof  is  relatively  simple  and  based  only  on  elementary  methods  from 
discrete  and  continuous  mathematics. 

Let  us  review  works  on  sorting  n  elements  in  a  parallel  comparison  model.  Haggkvist 
and  Hell  [HH-81]  proved  that  if  k.  the  number  of  rounds,  is  constant  .  then  Q.(n^~^'') 
processors  are  required  to  sort  n  elements.  Using  random  graphs  and  a  certain  probability 
space,  Bollobas  and  Thomason  [BT-831  proved  that  almost  every  algorithm  that  uses 
p  =  0( n'''Iogn)  processors  sorts  ;;  elements  in  two  rounds.  Bollobas  [Bo-86]  iterated  the 
random  graphs  techniques  to  prove  that  n  elements  can  be  sorted  in  a  constant  number  of 
rounds  k  using  Oi.n'~ '  ^iogn)  comparisons.  This  almost  matches  the  Haggkvist-Hell  lower 
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bound.  Remark.  Conversely,  these  results  imply  that  for  p  =  0(n^'^)  processors,  it  is 
impossible  to  sort  in  less  than  k=  Me.  rounds,  but  we  can  sort  in  k-  l/€+  1  rounds.  So  these 
upper  and  lower  bounds  are  at  most  one  round  apart  when  k  is  constant. 

However,  a  closer  look  at  this  lower  bound  of  Haggkvist  and  Hell  reveals  the 
following.   They  actually   proved   that     if  k,  the  number   of  rounds,   is   a   variable,   then 

p  >  — ^Zl~    "TT  processors  are  required  to  sort  n  elements.    We  compare  this  result  with 

Corollary  1  which  was  given  above.  Observe,  that  their  proof  implies  that  p  =  fKn^'^^^'') 
only  when  k  is  constant.  Even  for  constant  k  which  is  not  very  large  this  asymptotic  lower 
bound  contains  a  very  small  constant  factor  (for  instance,  if  k=100  then  n^'^''  is  multiplied 
by  less  than  2~^°°  ).  Moreover,  their  result  becomes  trivial  for  k  ^  ^\ogn.  This  is  since 
for  this  range  their  result  implies  an  asymptotic  bound  which  is  0(n)  for  the  number  of 
processors  p  as  can  be  readily  verified  On  the  other  hand.  Corollary  1  states  that 
p>n^~^^''/e  -n  for  every  k.  As  was  indicated  above,  this  implies  that  p  =  n(«^~"^^*),  for 
any  k<clogn,  where  0<f  <1  is  a  constant. 

We  note  two  additional  papers  whose  titles  are  related  to  the  title  of  the  present 
paper.  [Le-84]  proposed  an  adaptation  of  AKS  network  to  bounded  degree  n-node 
networks.  [MW-85]  gave  a  ^logn  louer  bound  for  parallel  sorting  by  n  processors  in  some 
variant  of  PRAM.  Their  model  is  not  comparable  to  the  parallel  comparison  model 
considered  here.  The  trivial  logn  lower  bound  for  parallel  sorting  by  n  processors  in  the 
parallel  comparison  model  does  not  allow  non  comparison  algorithms  like  bucket  sort.  On 
the  other  hand,  ranking  an  element  among  n  other  elements  can  be  done  in  one  round  of 
comparisons  using  n  processors  in  the  parallel  comparison  model,  while  their  PRAM  seems 
to  require  non  constant  time  using  n  processors. 

2.   The  Parallel  Comparison  Model 

Let  V  be  a  set  of  n  elements  taken  from  a  totally  ordered  domain.  The  parallel 
comparison  model  of  computation  allows  algorithms  that  work  as  follows.  The  algorithm 
consists  of  time  steps  called  round.s.  In  each  round  binary  comparisons  are  performed 
simultaneously.  The  input  for  each  comparison  are  iwo  elements  of  V.  The  output  ot  each 
comparison  is  one  of  the  following  two:  <  or  >.  Note  that  we  do  not  allow  equality 
between  two  elements  of  V.  This  can  be  done  without  loss  of  generality,  «ince  we  define 

Uliracompuier  .Note  103  P*g*  ^ 


the  order  between  two  equal  input  elements  to  be  the  order  of  their  indices.  Each  item 
may  take  part  in  several  comparisons  during  the  same  round. 

Remark.  Our  discussion  uses  the  following  correspondence  between  each  round  and  a 
graph.  The  elements  are  the  vertices.  Each  comparison  to  be  performed  is  an  undirected 
edge  which  connects  its  input  elements.  Each  computation  results  in  orienting  this  edge 
from  the  largest  element  to  the  smallest. 

Suppose  we  performed  r  rounds  where  r>0  is  some  integer.  Consider  any  function  of  V 
that  can  be  computed  using  the  comparisons  performed  in  these  r  rounds  without  any 
further  comparisons  of  elements  in  V.  Our  model  defines  such  a  function  to  be  computable 
following  round  r.  Note  that  this  definition  suppresses  all  computational  steps  that  do  not 
involve  comparisons  of  elements  in  V.  Which  comparisons  to  perform  at  round  r+1  and 
the  input  for  each  such  comparison  should  be  functions  which  are  computable  following 
round  r.  We  are  interested  in  sorting  the  elements  in  V  from  the  smallest  to  the  largest  in  k 
rounds,  where  the  integer  k  can  be  either  constant  or  a  function  of  n. 

Recall  that  c(k,n)  denotes  the  minimum  total  number  of  comparisons  required  to  sort 
any  n  elements  in  k  rounds  (over  all  possible  algorithms). 

3.   The  Lower  Bound 

The  Main  Theorem:  c{k,n)>k{ -  n)     for  any  k.n  ^  1,  where  e  is  the  base  of 

e 

the  natural  logarithm. 

Proof  of  the  Main  Theorem:  By  induction  on  k  and  n. 

The       base       of       the       induction.       For       k=I       and       every       n^\,       clearly 

~>  2 

n'—n  n  ^  ,  ^  . 

c{\,n)  -  — :; —  >  —    -  n.  For  n  =  \,l  and  every 

2  e 

1  4  2i"i''^ 

A:>1,     c(/t,l)>0> /t(--l),   c(A:.2)>0>  /t(-    -2)>/t( 2). 

e  e  e 

The  inductive  assumption:  Given  k.n.  if  k' s.k  and  n'<n.  or  k' <k  and  n'-£n,  then 

,-± 

cik'  .n')>k'(- -  n'). 

e 

Take  any  k-round  algorithm  tor  sorting  a  set  V  of  n  elements.  The  first  round  of  the 
algorithm  consists  of  some  set  £  of  comparisons.    Recall  that  we  look  a:  them  as  edges  in 
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the  graph  G  =  (V.E)  .  An  independent  set  in  G  is  a  subset  of  vertices  from  V  such  that  no 
two  vertices  are  adjacent  by  an  edge  in  E.  An  independent  set  is  maximal  if  it  is  not  a 
proper  subset  of  another  independent  set.  Consider  the  graph  of  the  first  round  of 
comparisons.  Let  5  be  a  maximal  independent  set  in  this  graph  and  denote  v=|5|.  Each  of 
the  n—.x  elements  of  5  must  share  an  edge  with  an  element  of  5,  otherwise  5  is  not 
maximal.  For  our  lower  bound  proof,  we  restrict  our  attention  to  linear  orders  on  V.  in 
which  each  element  of  S  is  greater  than  each  element  of  5.  For  any  of  these  orders  it  is 
impossible  to  obtain  any  information  regarding  the  relation  between  two  elements  of  5  or 
two  elements  of  S  using  comparisons  between  an  element  of  S  and  an  element  of  S. 
Therefore,  aside  from  these  n  —  x  comparisons,  there  must  be  at  least  c(k—\,.x) 
comparisons  to  sort  S  and  at  least  c{k,n—.x)  comparisons  to  sort  S.  This  implies  the 
following  recursive  inequality  : 

c(k,n)>c(k,n-.x)+  n-.x  +c(k-\,x)> 

By  the  inductive  assumption 

(n-x)^-^^  ^i-i/*-i 

>k(- '■ {n-x))  +  (n-x}  +  (k-l)i' x)  = 

e  e 

By  opening  parentheses  and  permuting  terms  we  get 

=  —{n-x)'-    ^^  + x'-    ^*    -  +n  -kn  = 

e  e 

-.  "         f^^      «'  ^^^      k^    n''"'  kn'"' 

Recall      the      Geometric      Arithmetic      Mean      Inequality  a.a  +  ^h^a'^b'^ ,      where 

1  1  v^-i'^-i  e 

a+P=l     a,^,a,b^O.   By  taking  a  =  1  -  —    ,    3=T    -   ^3  =  — ^_^^^      ,    b  =  ^-^    we  get. 


n  n 

,i'k 

>—  n'^^^K  ]-  —  )'''"'+ 1   -kn>l 


■-n'^'-''[(\--)'*'''+       '    .  ''''\]-kn: 
e  n  n'-^''-  n'''- 


Recall  that  the  increasing  sequence     (1  +  — )*  converges  to  e  and  therefore,  e^''^>\  +  —. 

k  k 

This  implies 

>A„i-i,^[(l--)^-^'^+-ll  +  |)]  -kn^ 
e  n  n  k 

Recall  Bernoulli's  Inequalit> :  i  1  -  a  )'>  1  -ar  for  r>  1     a<  I .  This  implies. 
e  n  k        n  < 


=  —  n  ■  ~  " ''  -kn=k( n ) 


.:-ik      ■        .  .  "■ 


(?  e 
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This  completes  the  proof  of  the  Main  Theorem. 

4.   Upper  Bound 

Theorem.  (  due  to  N'oga  Alon  )  :  Given  n  elements  from  a  totally  ordered  domain, 
there    is    an    algorithm    in    a    parallel   comparison    model    for    sorting   these    elements    in 

logn 
0{- -^ — ; — )  rounds  using  n^^z  processors. 

proof.   First  recall  the  AKS  comparison  network.  It  sorts  n  elements  in  O(logn)  rounds 

using  p=n/2  processors  (i.e.,  comparisons  in  each  round).   We  give  an   algorithm   m  a 

parallel  comparison  model.  Each  round  of  the  new  algorithm  is  called  superround.    The 

algorithm  is  derived  from  AKS  network  by  simply  shrinking  6  =  0.5/o^f  1  ^p/«)  rounds  of 

this  network  into  one  superround. 

The     construction   of  the   algorithm    is   based   on   the   following   idea.     We    aim   that  the 

following  Assertion  will  hold. 

Assertion.   After  superround  r,  the  following  things  are  available:  (1)  The  pair  of  input 

elements  for  each  comparison  performed  in  the  first  hr  rounds  of  AKS  network.    (2)  The 

result  of  each  such  comparison. 

This     Assertion     implies     that    after     Oi- — ■— ; — )     superrounds     the     results    of    all 

^  logiX-^p/n)  *^ 

comparisons  of  AKS  network  are  available  and  the  sorting  is  completed  (since  it  is 
computable). 

We  show  how  to  satisfy  the  Assertion  for  any  superround  r.  For  r  =  0  the  .Assertion 
trivially  holds.  We  show  how  to  satisfy  the  Assertion  for  superround  r  assuming  that  it  is 
satisfied  for  any  superround  <r. 

The  fact  that  we  relate  to  a  comparison  network  implies  that  each  element,  which  is 
compared  in  round  h(r-\)  +  i.  where  I<;<8,  is  one  of  at  most  2'"^  elements  which  are 
outputs  of  comparisons  of  the  first  8(r-l)  rounds  (or  input  elements).  By  the  inductive 
assumption,  each  of  these  outputs  is  available  following  superround  r—\.  Therefore,  each 
comparison  in  round  h{r—\)  +  i.  is  actually  one  of  (2'~'l"  possible  pairs.  .\11  we  do  is 
perform  all  these  possible  comparisons  simultaneously  ffor  l</<8).  These  comparisons 
clearly  include  the  actual  comparisons  performed  by  .■\KS  network  in  these  rounds.  It 
remains  to  show  that  this  construction  also  yields  the  pairs  of  input  elements  to  each 
comparison  which  was  actually  performed  in  each  of  the^e  rounds.  For  this  we  show  by 
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simple  induction  that  the  actual  pair  of  each  comparison,  as  well  as  its  result  are  available, 
for  all  rounds  :s8(r— 1)  +  ;,  0  ^  ;  ^  8.  For  /  =  0,  this  follows  from  the  inductive 
assumption  of  the  Assertion  for  r-l.  Suppose  that  for  all  rounds  <8(r-l)  +  /.  the  actual 
pairs  compared,  as  well  as  their  result  are  available.  Each  element  participating  in  round 
8(r-l)  +  /  is  an  outcome  of  the  actual  comparisons  of  preceding  rounds  and  their  results. 
They  are  known  by  the  inductive  assumption.  Therefore,  the  input  pair  for  each  such 
comparison  is  known.  We  already  argued  that  the  result  of  each  such  comparison  was 
found  by  our  algorithm.  This  completes  the  proof  of  the  induction  for  ;.  Taking  /  =  8,  we 
complete  the  inductive  proof  of  the  Assertion. 


The  number  of  comparisons  that  the  algorithm  has  to  perform  in  each  superround  is: 


But    8  =  0.5/0^(1  +  "^),    and    therefore,    this    number    of   comparisons    is    not    more    than 

p 
^    logH^—)        „  p        n+  p 

—2  ^  —  (,  1  +  '^)  =  — ::;^  —  P-    So  there  are  enough  processors  to  perform  all  these 

comparisons. 
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